FMA: A Dataset For Music Analysis is a dataset consisting of all audio and metadata for music available in the Free Music Archive. The dataset has over 100,000 different music tracks, and metadata such as album, artist, genre, country of origin, release date, and various pre-calculated features that are commonly used in audio engineering analyses.
The purpose of this project is to determine if mel-frequency spectral coefficients (MFCCs) can be used to determine the genre of a song. For any audio file, MFCCs can be calculated by taking a Short Time Fourier Transform (STFT) of the time series (going into the frequency domain), taking the logarithm of the Fourier transform's magnitude for each window where the Fourier Transform was applied by STFT (which would make the data now be in the quefrency domain), and then applying the Mel scale to that. The Mel scale is essentially a conversion to account for nonlinearities in how humans perceive pitch; at higher frequencies, pitches are perceived to be further away. For example, a 1000 Hz sound will be perceived as being further from a 900 Hz sound than a 400 Hz sound is from a 300 Hz sound, despite the difference being 100 Hz in both cases. Accordingly, MFCCs can be considered as representing distinct units of sound corrsponding to the shape of a person's vocal tract as they speak or sing. This makes MFCCs a commonly used feature for problems and applications related to speech recognition and speaker recognition (e.g. to predict the accent or gender of a speaker) [1].
For convenience, the FMA dataset already provides the calculated MFCCs for all of the tracks within it. Although the dataset is split into different subset sizes to make analysis easier to deal with when dealing with audio files, only the data in fma_metadata.zip will be used since that contains the MFCC features already, and the entire dataset's metadata is significantly smaller than the audio files for even the small subset (342 MB vs. 7.2 GB). The particular genres that will be considered for classification will be discussed in the Data Exploration & Pre-Processing section of this notebook.
This section covers all data exploration and pre-processing completed. The structure of this section alternates between data cleanup and data exploration / visualizations to ensure that all data has been prepared properly before moving to model implementations.
Begin by loading the echonest.csv, features.csv, genres.csv, and tracks.csv files and printing a few rows to visualize their structure before cleaning them.
import pandas as pd
#Based on https://stackoverflow.com/a/68620427
echonest = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/echonest.csv?sp=r&st=2022-08-05T01:21:50Z&se=2022-08-08T09:21:50Z&spr=https&sv=2021-06-08&sr=b&sig=BFnIwi75Mi9lI8cJ43OVGr8AN3zqiTB8oimIKSZMHvw%3D')
echonest.head(10)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3134: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,11,13,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249) have mixed types.Specify dtype option on import or set low_memory=False.
| Unnamed: 0 | echonest | echonest.1 | echonest.2 | echonest.3 | echonest.4 | echonest.5 | echonest.6 | echonest.7 | echonest.8 | ... | echonest.239 | echonest.240 | echonest.241 | echonest.242 | echonest.243 | echonest.244 | echonest.245 | echonest.246 | echonest.247 | echonest.248 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | audio_features | audio_features | audio_features | audio_features | audio_features | audio_features | audio_features | audio_features | metadata | ... | temporal_features | temporal_features | temporal_features | temporal_features | temporal_features | temporal_features | temporal_features | temporal_features | temporal_features | temporal_features |
| 1 | NaN | acousticness | danceability | energy | instrumentalness | liveness | speechiness | tempo | valence | album_date | ... | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 |
| 2 | track_id | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2 | 0.4166752327 | 0.6758939853 | 0.6344762684 | 0.0106280683 | 0.1776465712 | 0.1593100648 | 165.9220000000 | 0.5766609880 | NaN | ... | -1.9923025370 | 6.8056936264 | 0.2330697626 | 0.1928800046 | 0.0274549890 | 0.0640799999 | 3.6769599915 | 3.6128799915 | 13.3166904449 | 262.9297485352 |
| 4 | 3 | 0.3744077685 | 0.5286430621 | 0.8174611317 | 0.0018511032 | 0.1058799438 | 0.4618181276 | 126.9570000000 | 0.2692402421 | NaN | ... | -1.5823311806 | 8.8893079758 | 0.2584637702 | 0.2209050059 | 0.0813684240 | 0.0641300008 | 6.0827698708 | 6.0186400414 | 16.6735477448 | 325.5810852051 |
| 5 | 5 | 0.0435668989 | 0.7455658702 | 0.7014699916 | 0.0006967990 | 0.3731433124 | 0.1245953419 | 100.2600000000 | 0.6216612236 | NaN | ... | -2.2883579731 | 11.5271091461 | 0.2568213642 | 0.2378199995 | 0.0601223968 | 0.0601399988 | 5.9264898300 | 5.8663496971 | 16.0138492584 | 356.7557373047 |
| 6 | 10 | 0.9516699648 | 0.6581786543 | 0.9245251615 | 0.9654270154 | 0.1154738842 | 0.0329852191 | 111.5620000000 | 0.9635898919 | 2008-03-11 | ... | -3.6629877090 | 21.5082283020 | 0.2833518982 | 0.2670699954 | 0.1257044971 | 0.0808200017 | 8.4140100479 | 8.3331899643 | 21.3170642853 | 483.4038085938 |
| 7 | 134 | 0.4522173071 | 0.5132380502 | 0.5604099311 | 0.0194426943 | 0.0965666940 | 0.5255193792 | 114.2900000000 | 0.8940722715 | NaN | ... | -1.4526963234 | 2.3563981056 | 0.2346863896 | 0.1995500028 | 0.1493317783 | 0.0644000024 | 11.2670698166 | 11.2026700974 | 26.4541797638 | 751.1477050781 |
| 8 | 139 | 0.1065495253 | 0.2609111726 | 0.6070668636 | 0.8350869898 | 0.2236762711 | 0.0305692764 | 196.9610000000 | 0.1602670903 | NaN | ... | -3.0786671638 | 12.4115667343 | 0.2708015740 | 0.2727000117 | 0.0252420790 | 0.0640399978 | 2.4366900921 | 2.3726501465 | 3.8970954418 | 37.8660430908 |
| 9 | 140 | 0.3763124975 | 0.7340790229 | 0.2656847734 | 0.6695811237 | 0.0859951222 | 0.0390682262 | 107.9520000000 | 0.6099912728 | NaN | ... | -0.9346956015 | -0.2609805167 | 0.3222317100 | 0.2779799998 | 0.1367472708 | 0.0753299966 | 9.8627195358 | 9.7873897552 | 21.9816207886 | 562.2294311523 |
10 rows × 250 columns
features = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/features.csv?sp=r&st=2022-08-05T01:22:56Z&se=2022-08-08T09:22:56Z&spr=https&sv=2021-06-08&sr=b&sig=YGgywE7ZJ0X490qBwWdljCaRSj7VzXkik5rjJB1WqVo%3D')
features.head(10)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3134: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518) have mixed types.Specify dtype option on import or set low_memory=False.
| feature | chroma_cens | chroma_cens.1 | chroma_cens.2 | chroma_cens.3 | chroma_cens.4 | chroma_cens.5 | chroma_cens.6 | chroma_cens.7 | chroma_cens.8 | ... | tonnetz.39 | tonnetz.40 | tonnetz.41 | zcr | zcr.1 | zcr.2 | zcr.3 | zcr.4 | zcr.5 | zcr.6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | statistics | kurtosis | kurtosis | kurtosis | kurtosis | kurtosis | kurtosis | kurtosis | kurtosis | kurtosis | ... | std | std | std | kurtosis | max | mean | median | min | skew | std |
| 1 | number | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | ... | 04 | 05 | 06 | 01 | 01 | 01 | 01 | 01 | 01 | 01 |
| 2 | track_id | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2 | 7.1806526184e+00 | 5.2303090096e+00 | 2.4932080507e-01 | 1.3476201296e+00 | 1.4824777842e+00 | 5.3137123585e-01 | 1.4815930128e+00 | 2.6914546490e+00 | 8.6686819792e-01 | ... | 5.4125156254e-02 | 1.2225749902e-02 | 1.2110591866e-02 | 5.7588901520e+00 | 4.5947265625e-01 | 8.5629448295e-02 | 7.1289062500e-02 | 0.0000000000e+00 | 2.0898721218e+00 | 6.1448108405e-02 |
| 4 | 3 | 1.8889633417e+00 | 7.6053929329e-01 | 3.4529656172e-01 | 2.2952005863e+00 | 1.6540306807e+00 | 6.7592434585e-02 | 1.3668476343e+00 | 1.0540937185e+00 | 1.0810308903e-01 | ... | 6.3831120729e-02 | 1.4211839065e-02 | 1.7740072682e-02 | 2.8246941566e+00 | 4.6630859375e-01 | 8.4578499198e-02 | 6.3964843750e-02 | 0.0000000000e+00 | 1.7167237997e+00 | 6.9330163300e-02 |
| 5 | 5 | 5.2756297588e-01 | -7.7654317021e-02 | -2.7961030602e-01 | 6.8588310480e-01 | 1.9375696182e+00 | 8.8083887100e-01 | -9.2319184542e-01 | -9.2723226547e-01 | 6.6661673784e-01 | ... | 4.0730185807e-02 | 1.2690781616e-02 | 1.4759079553e-02 | 6.8084154129e+00 | 3.7500000000e-01 | 5.3114086390e-02 | 4.1503906250e-02 | 0.0000000000e+00 | 2.1933031082e+00 | 4.4860601425e-02 |
| 6 | 10 | 3.7022454739e+00 | -2.9119303823e-01 | 2.1967420578e+00 | -2.3444947600e-01 | 1.3673638105e+00 | 9.9841135740e-01 | 1.7706941366e+00 | 1.6045658588e+00 | 5.2121698856e-01 | ... | 7.4357867241e-02 | 1.7951935530e-02 | 1.3921394013e-02 | 2.1434211731e+01 | 4.5214843750e-01 | 7.7514506876e-02 | 7.1777343750e-02 | 0.0000000000e+00 | 3.5423245430e+00 | 4.0800448507e-02 |
| 7 | 20 | -1.9383698702e-01 | -1.9852678478e-01 | 2.0154602826e-01 | 2.5855624676e-01 | 7.7520370483e-01 | 8.4794059396e-02 | -2.8929358721e-01 | -8.1641042233e-01 | 4.3850939721e-02 | ... | 9.5002755523e-02 | 2.2492416203e-02 | 2.1355332807e-02 | 1.6669036865e+01 | 4.6972656250e-01 | 4.7224905342e-02 | 4.0039062500e-02 | 9.7656250000e-04 | 3.1898307800e+00 | 3.0992921442e-02 |
| 8 | 26 | -6.9953453541e-01 | -6.8415790796e-01 | 4.8824872822e-02 | 4.2658798397e-02 | -8.1896692514e-01 | -9.1712284088e-01 | -9.0183424950e-01 | -6.6844828427e-02 | -2.9103723168e-01 | ... | 1.0371652246e-01 | 2.5541320443e-02 | 2.3846302181e-02 | 4.1645809174e+01 | 2.5048828125e-01 | 1.8387714401e-02 | 1.5625000000e-02 | 0.0000000000e+00 | 4.6905956268e+00 | 1.4598459937e-02 |
| 9 | 30 | -7.2148716450e-01 | -8.4855991602e-01 | 8.9090377092e-01 | 8.8619679213e-02 | -4.4551330805e-01 | -1.2711701393e+00 | -1.2401897907e+00 | -1.3437650204e+00 | -9.0560036898e-01 | ... | 1.4169253409e-01 | 2.0426128060e-02 | 2.5417611003e-02 | 8.1665945053e+00 | 5.4687500000e-01 | 5.4416511208e-02 | 3.6132812500e-02 | 2.4414062500e-03 | 2.2447082996e+00 | 5.2673552185e-02 |
10 rows × 519 columns
genres = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/genres.csv?sp=r&st=2022-08-05T01:23:51Z&se=2022-08-08T09:23:51Z&spr=https&sv=2021-06-08&sr=b&sig=2ks%2FlQh1gLMdw79lsgGmpyRzbWbzkXiX21%2FIGSJ3cos%3D')
genres.head(10)
| genre_id | #tracks | parent | title | top_level | |
|---|---|---|---|---|---|
| 0 | 1 | 8693 | 38 | Avant-Garde | 38 |
| 1 | 2 | 5271 | 0 | International | 2 |
| 2 | 3 | 1752 | 0 | Blues | 3 |
| 3 | 4 | 4126 | 0 | Jazz | 4 |
| 4 | 5 | 4106 | 0 | Classical | 5 |
| 5 | 6 | 914 | 38 | Novelty | 38 |
| 6 | 7 | 217 | 20 | Comedy | 20 |
| 7 | 8 | 868 | 0 | Old-Time / Historic | 8 |
| 8 | 9 | 1987 | 0 | Country | 9 |
| 9 | 10 | 13845 | 0 | Pop | 10 |
tracks = pd.read_csv('https://fwadia.blob.core.windows.net/azureml-blobstore-06b00e51-18b9-4fbb-8ab4-b40dab42308f/UI/08-04-2022_022328_UTC/tracks.csv?sp=r&st=2022-08-05T01:28:19Z&se=2022-08-08T09:28:19Z&spr=https&sv=2021-06-08&sr=b&sig=3WPRQ7zcry5EOW6w5RdQZVri0xShN1bGos2o4m3U0jk%3D')
tracks.head(10)
/anaconda/envs/azureml_py38/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3134: DtypeWarning: Columns (0,1,5,6,8,12,18,20,21,22,24,33,34,38,39,44,47,49) have mixed types.Specify dtype option on import or set low_memory=False.
| Unnamed: 0 | album | album.1 | album.2 | album.3 | album.4 | album.5 | album.6 | album.7 | album.8 | ... | track.10 | track.11 | track.12 | track.13 | track.14 | track.15 | track.16 | track.17 | track.18 | track.19 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | comments | date_created | date_released | engineer | favorites | id | information | listens | producer | ... | information | interest | language_code | license | listens | lyricist | number | publisher | tags | title |
| 1 | track_id | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2 | 0 | 2008-11-26 01:44:45 | 2009-01-05 00:00:00 | NaN | 4 | 1 | <p></p> | 6073 | NaN | ... | NaN | 4656 | en | Attribution-NonCommercial-ShareAlike 3.0 Inter... | 1293 | NaN | 3 | NaN | [] | Food |
| 3 | 3 | 0 | 2008-11-26 01:44:45 | 2009-01-05 00:00:00 | NaN | 4 | 1 | <p></p> | 6073 | NaN | ... | NaN | 1470 | en | Attribution-NonCommercial-ShareAlike 3.0 Inter... | 514 | NaN | 4 | NaN | [] | Electric Ave |
| 4 | 5 | 0 | 2008-11-26 01:44:45 | 2009-01-05 00:00:00 | NaN | 4 | 1 | <p></p> | 6073 | NaN | ... | NaN | 1933 | en | Attribution-NonCommercial-ShareAlike 3.0 Inter... | 1151 | NaN | 6 | NaN | [] | This World |
| 5 | 10 | 0 | 2008-11-26 01:45:08 | 2008-02-06 00:00:00 | NaN | 4 | 6 | NaN | 47632 | NaN | ... | NaN | 54881 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 50135 | NaN | 1 | NaN | [] | Freeway |
| 6 | 20 | 0 | 2008-11-26 01:45:05 | 2009-01-06 00:00:00 | NaN | 2 | 4 | <p> "spiritual songs" from Nicky Cook</p> | 2710 | NaN | ... | NaN | 978 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 361 | NaN | 3 | NaN | [] | Spiritual Level |
| 7 | 26 | 0 | 2008-11-26 01:45:05 | 2009-01-06 00:00:00 | NaN | 2 | 4 | <p> "spiritual songs" from Nicky Cook</p> | 2710 | NaN | ... | NaN | 1060 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 193 | NaN | 4 | NaN | [] | Where is your Love? |
| 8 | 30 | 0 | 2008-11-26 01:45:05 | 2009-01-06 00:00:00 | NaN | 2 | 4 | <p> "spiritual songs" from Nicky Cook</p> | 2710 | NaN | ... | NaN | 718 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 612 | NaN | 5 | NaN | [] | Too Happy |
| 9 | 46 | 0 | 2008-11-26 01:45:05 | 2009-01-06 00:00:00 | NaN | 2 | 4 | <p> "spiritual songs" from Nicky Cook</p> | 2710 | NaN | ... | NaN | 252 | en | Attribution-NonCommercial-NoDerivatives (aka M... | 171 | NaN | 8 | NaN | [] | Yosemite |
10 rows × 53 columns
Based on the results above, the dataframe headers need to be cleaned, and attributes that will not likely be needed can be removed.
Only the acousticness, danceability, energy, instrumentalness, liveness, speechiness, tempo, and valence features are needed from echonest. Descriptions of these features can be found in the Spotify API documentation, and have been calculated by Echo Nest (which is now a part of Spotify) using their own machine learning methods. These features will not be used for classification, but rather to understand during the data exploration stage how these features are distributed among the genres.
#Echonest cleaning
#Make 2nd row as header
echonest.columns = echonest.iloc[1]
echonest = echonest[2:]
echonest.rename(columns={echonest.columns[0]: "track_id"}, inplace = True)
#Remove blank row
echonest = echonest[1:]
#Keep desired features only
echonest = echonest[['track_id', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'tempo', 'valence']]
#Ensure data is numeric to help later on
echonest = echonest.apply(pd.to_numeric)
# Verify that dataframe looks right
echonest.head()
| 1 | track_id | acousticness | danceability | energy | instrumentalness | liveness | speechiness | tempo | valence |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 2 | 0.416675 | 0.675894 | 0.634476 | 0.010628 | 0.177647 | 0.159310 | 165.922 | 0.576661 |
| 4 | 3 | 0.374408 | 0.528643 | 0.817461 | 0.001851 | 0.105880 | 0.461818 | 126.957 | 0.269240 |
| 5 | 5 | 0.043567 | 0.745566 | 0.701470 | 0.000697 | 0.373143 | 0.124595 | 100.260 | 0.621661 |
| 6 | 10 | 0.951670 | 0.658179 | 0.924525 | 0.965427 | 0.115474 | 0.032985 | 111.562 | 0.963590 |
| 7 | 134 | 0.452217 | 0.513238 | 0.560410 | 0.019443 | 0.096567 | 0.525519 | 114.290 | 0.894072 |
Only keep the track ID and mfcc features since the other features are outside the scope of this project. The track IDs are necessary to be able to lookup genres. Note that FMA provides the mean, median, minimum, maximum, standard deviation, skew, and kurtosis for the first 20 MFCCs for each track. In this case, only the mean of each track's MFCC coefficient will be used.
features.rename(columns={features.columns[0]: "track_id"}, inplace = True)
#Keep track_id and MFCC statistics
cols = ["track_id"]
cols.extend([f for f in features.columns.values if "mfcc" in f])
mfccs = features[cols]
#Only keep means and track_id
cols = mfccs.loc[0] == "mean"
cols["track_id"] = True
mfccs = mfccs.loc[:, cols]
new_headers = ["track_id"]
new_headers.extend(["mfcc." + str(x) for x in range(1, len(mfccs.columns))])
mfccs.columns = new_headers
mfccs = mfccs[3:]
mfccs = mfccs.dropna()
mfccs.head()
| track_id | mfcc.1 | mfcc.2 | mfcc.3 | mfcc.4 | mfcc.5 | mfcc.6 | mfcc.7 | mfcc.8 | mfcc.9 | ... | mfcc.11 | mfcc.12 | mfcc.13 | mfcc.14 | mfcc.15 | mfcc.16 | mfcc.17 | mfcc.18 | mfcc.19 | mfcc.20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 2 | -1.6377296448e+02 | 1.1669667816e+02 | -4.1753826141e+01 | 2.9144329071e+01 | -1.5050157547e+01 | 1.8879371643e+01 | -8.9181652069e+00 | 1.2002118111e+01 | -4.2531509399e+00 | ... | -2.6829998493e+00 | -7.9463183880e-01 | -6.9209713936e+00 | -3.6553659439e+00 | 1.4652130604e+00 | 2.0107804239e-01 | 3.9982039928e+00 | -2.1146764755e+00 | 1.1684176326e-01 | -5.7858843803e+00 |
| 4 | 3 | -1.5900416565e+02 | 1.2015850067e+02 | -3.3233562469e+01 | 4.7342002869e+01 | -6.2473182678e+00 | 3.1405355453e+01 | -5.2618112564e+00 | 1.1618971825e+01 | -1.5958366394e+00 | ... | -3.4226787090e+00 | 6.9492840767e+00 | -4.1752557755e+00 | -3.5288145542e+00 | 2.7471557260e-01 | -2.2706823349e+00 | 1.0904747248e+00 | -2.3438842297e+00 | 4.7182095051e-01 | -1.5467071533e+00 |
| 5 | 5 | -2.0544049072e+02 | 1.3221507263e+02 | -1.6085823059e+01 | 4.1514759064e+01 | -7.6429538727e+00 | 1.6942802429e+01 | -5.6512613297e+00 | 9.5694446564e+00 | 5.0315696001e-01 | ... | -8.2713766098e+00 | 5.9447294474e-01 | -3.4020280838e-01 | 2.3778877258e+00 | 7.8994874954e+00 | 1.9476414919e+00 | 7.4419503212e+00 | -1.7399110794e+00 | 2.7801498771e-01 | -5.4890155792e+00 |
| 6 | 10 | -1.3586482239e+02 | 1.5704008484e+02 | -5.3453247070e+01 | 1.7198896408e+01 | 6.8680348396e+00 | 1.3934344292e+01 | -1.1749298096e+01 | 8.3607110977e+00 | -5.1303811073e+00 | ... | -5.4212064743e+00 | 1.6794785261e+00 | -6.2182493210e+00 | 1.8441945314e+00 | -4.0997042656e+00 | 7.7994996309e-01 | -5.5957680941e-01 | -1.0183241367e+00 | -3.8075449467e+00 | -6.7953306437e-01 |
| 7 | 20 | -1.3513589478e+02 | 1.1481417847e+02 | 1.2354539871e+01 | 1.9764219284e+01 | 1.8670799255e+01 | 1.9643861771e+01 | 3.5725092888e+00 | 1.2124897003e+01 | -2.2851834297e+00 | ... | -8.0546313524e-01 | 4.0829424858e+00 | 2.1424494684e-01 | 3.8759169579e+00 | -2.3532356322e-01 | 3.9029249549e-01 | -5.7247143984e-01 | 2.7791724205e+00 | 2.4312584400e+00 | 3.0311167240e+00 |
5 rows × 21 columns
The genres dataframe is already cleaned, but for reference and easier visualization later, it would be good to extract the title names of all the parent genres and top level genres. This will then be used to make a decision on the genres that will be used for classification.
# Get unique parent IDs
parent_ids = set(genres[['parent']].values.reshape(-1))
#Get unique top_level IDs
top_level_ids = set(genres[['top_level']].values.reshape(-1))
# Form dataframes
parent_genres = genres.loc[genres['genre_id'].isin(parent_ids)]
top_level_genres = genres.loc[genres['genre_id'].isin(top_level_ids)]
print("The dataset contains the following", len(genres), "genres: \n")
print(genres[['title']].values.reshape(-1).tolist())
The dataset contains the following 163 genres: ['Avant-Garde', 'International', 'Blues', 'Jazz', 'Classical', 'Novelty', 'Comedy', 'Old-Time / Historic', 'Country', 'Pop', 'Disco', 'Rock', 'Easy Listening', 'Soul-RnB', 'Electronic', 'Sound Effects', 'Folk', 'Soundtrack', 'Funk', 'Spoken', 'Hip-Hop', 'Audio Collage', 'Punk', 'Post-Rock', 'Lo-Fi', 'Field Recordings', 'Metal', 'Noise', 'Psych-Folk', 'Krautrock', 'Jazz: Vocal', 'Experimental', 'Electroacoustic', 'Ambient Electronic', 'Radio Art', 'Loud-Rock', 'Latin America', 'Drone', 'Free-Folk', 'Noise-Rock', 'Psych-Rock', 'Bluegrass', 'Electro-Punk', 'Radio', 'Indie-Rock', 'Industrial', 'No Wave', 'Free-Jazz', 'Experimental Pop', 'French', 'Reggae - Dub', 'Afrobeat', 'Nerdcore', 'Garage', 'Indian', 'New Wave', 'Post-Punk', 'Sludge', 'African', 'Freak-Folk', 'Jazz: Out', 'Progressive', 'Alternative Hip-Hop', 'Death-Metal', 'Middle East', 'Singer-Songwriter', 'Ambient', 'Hardcore', 'Power-Pop', 'Space-Rock', 'Polka', 'Balkan', 'Unclassifiable', 'Europe', 'Americana', 'Spoken Weird', 'Interview', 'Black-Metal', 'Rockabilly', 'Easy Listening: Vocal', 'Brazilian', 'Asia-Far East', 'N. Indian Traditional', 'South Indian Traditional', 'Bollywood', 'Pacific', 'Celtic', 'Be-Bop', 'Big Band/Swing', 'British Folk', 'Techno', 'House', 'Glitch', 'Minimal Electronic', 'Breakcore - Hard', 'Sound Poetry', '20th Century Classical', 'Poetry', 'Talk Radio', 'North African', 'Sound Collage', 'Flamenco', 'IDM', 'Chiptune', 'Musique Concrete', 'Improv', 'New Age', 'Trip-Hop', 'Dance', 'Chip Music', 'Lounge', 'Goth', 'Composed Music', 'Drum & Bass', 'Shoegaze', 'Kid-Friendly', 'Thrash', 'Synth Pop', 'Banter', 'Deep Funk', 'Spoken Word', 'Chill-out', 'Bigbeat', 'Surf', 'Radio Theater', 'Grindcore', 'Rock Opera', 'Opera', 'Chamber Music', 'Choral Music', 'Symphony', 'Minimalism', 'Musical Theater', 'Dubstep', 'Skweee', 'Western Swing', 'Downtempo', 'Cumbia', 'Latin', 'Sound Art', 'Romany (Gypsy)', 'Compilation', 'Rap', 'Breakbeat', 'Gospel', 'Abstract Hip-Hop', 'Reggae - Dancehall', 'Spanish', 'Country & Western', 'Contemporary Classical', 'Wonky', 'Jungle', 'Klezmer', 'Holiday', 'Salsa', 'Nu-Jazz', 'Hip-Hop Beats', 'Modern Jazz', 'Turkish', 'Tango', 'Fado', 'Christmas', 'Instrumental']
print("The dataset contains the following", len(parent_genres), "parent genres: \n")
print(parent_genres[['title']].values.reshape(-1).tolist())
The dataset contains the following 39 parent genres: ['International', 'Blues', 'Jazz', 'Classical', 'Novelty', 'Country', 'Pop', 'Rock', 'Easy Listening', 'Soul-RnB', 'Electronic', 'Sound Effects', 'Folk', 'Soundtrack', 'Funk', 'Spoken', 'Hip-Hop', 'Punk', 'Post-Rock', 'Metal', 'Experimental', 'Loud-Rock', 'Latin America', 'Noise-Rock', 'Radio', 'Reggae - Dub', 'Garage', 'Indian', 'African', 'Middle East', 'Hardcore', 'Europe', 'Techno', 'House', 'Chip Music', 'Dubstep', 'Country & Western', 'Holiday', 'Instrumental']
print("The dataset contains the following", len(top_level_genres), "top-level genres: \n")
print(top_level_genres[['title']].values.reshape(-1).tolist())
The dataset contains the following 16 top-level genres: ['International', 'Blues', 'Jazz', 'Classical', 'Old-Time / Historic', 'Country', 'Pop', 'Rock', 'Easy Listening', 'Soul-RnB', 'Electronic', 'Folk', 'Spoken', 'Hip-Hop', 'Experimental', 'Instrumental']
Looking at the top level genres, all seem uniquely distinct and to be an appropriate number of broad genres to be able to differentiate songs from one another. The only exception to this is the "International" top-level genre, which consists of 7 parent genres which should be considered distinct in their own right. For example, although European and Indian music are both international, both would be expected to sound completely different! To make the model more robust to recognize such a difference, and so that these unique genres don't get lumped together, the international top-level genre should be replaced by its parent genres. It should be noted that by doing this, there will still be an International genre, but genres like Latin America, Middle East, and Indian will be distinct from it.
international_genres_df = parent_genres.loc[parent_genres['top_level'] == 2]
international_genres_df["title"].values.tolist()
['International', 'Latin America', 'Reggae - Dub', 'Indian', 'African', 'Middle East', 'Europe']
# Create dataframe of the genres that will be considered for classification
classification_genres = pd.concat([ top_level_genres[top_level_genres['title'] != "International"],
parent_genres.loc[parent_genres['top_level'] == 2]
], ignore_index=True)
classification_genres
| genre_id | #tracks | parent | title | top_level | |
|---|---|---|---|---|---|
| 0 | 3 | 1752 | 0 | Blues | 3 |
| 1 | 4 | 4126 | 0 | Jazz | 4 |
| 2 | 5 | 4106 | 0 | Classical | 5 |
| 3 | 8 | 868 | 0 | Old-Time / Historic | 8 |
| 4 | 9 | 1987 | 0 | Country | 9 |
| 5 | 10 | 13845 | 0 | Pop | 10 |
| 6 | 12 | 32923 | 0 | Rock | 12 |
| 7 | 13 | 730 | 0 | Easy Listening | 13 |
| 8 | 14 | 1499 | 0 | Soul-RnB | 14 |
| 9 | 15 | 34413 | 0 | Electronic | 15 |
| 10 | 17 | 12706 | 0 | Folk | 17 |
| 11 | 20 | 1876 | 0 | Spoken | 20 |
| 12 | 21 | 8389 | 0 | Hip-Hop | 21 |
| 13 | 38 | 38154 | 0 | Experimental | 38 |
| 14 | 1235 | 14938 | 0 | Instrumental | 1235 |
| 15 | 2 | 5271 | 0 | International | 2 |
| 16 | 46 | 573 | 2 | Latin America | 2 |
| 17 | 79 | 880 | 2 | Reggae - Dub | 2 |
| 18 | 86 | 216 | 2 | Indian | 2 |
| 19 | 92 | 329 | 2 | African | 2 |
| 20 | 102 | 176 | 2 | Middle East | 2 |
| 21 | 130 | 727 | 2 | Europe | 2 |
print("The", len(classification_genres), "genres listed above will be considered for the classification task.")
The 22 genres listed above will be considered for the classification task.
Update the column headers for the tracks dataframe and then pre-process to assign genres properly.
pre = [f.split(".")[0] + "_" for f in tracks.columns] # keep the first word as a prefix for the feature (e.g. album, track, artist)
tracks_feature_names = [f for f in tracks.iloc[0]]
columns = [str(pre[i]) + str(tracks_feature_names[i]) for i in range(len(tracks_feature_names))]
columns[0] = "track_id"
tracks.columns = columns
tracks = tracks[2:]
# Print updated column headers for reference
tracks.columns.values
array(['track_id', 'album_comments', 'album_date_created',
'album_date_released', 'album_engineer', 'album_favorites',
'album_id', 'album_information', 'album_listens', 'album_producer',
'album_tags', 'album_title', 'album_tracks', 'album_type',
'artist_active_year_begin', 'artist_active_year_end',
'artist_associated_labels', 'artist_bio', 'artist_comments',
'artist_date_created', 'artist_favorites', 'artist_id',
'artist_latitude', 'artist_location', 'artist_longitude',
'artist_members', 'artist_name', 'artist_related_projects',
'artist_tags', 'artist_website', 'artist_wikipedia_page',
'set_split', 'set_subset', 'track_bit_rate', 'track_comments',
'track_composer', 'track_date_created', 'track_date_recorded',
'track_duration', 'track_favorites', 'track_genre_top',
'track_genres', 'track_genres_all', 'track_information',
'track_interest', 'track_language_code', 'track_license',
'track_listens', 'track_lyricist', 'track_number',
'track_publisher', 'track_tags', 'track_title'], dtype=object)
Based on the feature names above, only track_id, track_genre_top, track_genres, track_genres_all, and track_language_code would be expected to be useful. All other features can be removed.
tracks = tracks[['track_id', 'track_genre_top', 'track_genres', 'track_genres_all', 'track_language_code']]
# Verify that the dataframe looks ok
tracks.head(20)
| track_id | track_genre_top | track_genres | track_genres_all | track_language_code | |
|---|---|---|---|---|---|
| 2 | 2 | Hip-Hop | [21] | [21] | en |
| 3 | 3 | Hip-Hop | [21] | [21] | en |
| 4 | 5 | Hip-Hop | [21] | [21] | en |
| 5 | 10 | Pop | [10] | [10] | en |
| 6 | 20 | NaN | [76, 103] | [17, 10, 76, 103] | en |
| 7 | 26 | NaN | [76, 103] | [17, 10, 76, 103] | en |
| 8 | 30 | NaN | [76, 103] | [17, 10, 76, 103] | en |
| 9 | 46 | NaN | [76, 103] | [17, 10, 76, 103] | en |
| 10 | 48 | NaN | [76, 103] | [17, 10, 76, 103] | en |
| 11 | 134 | Hip-Hop | [21] | [21] | en |
| 12 | 135 | Rock | [45, 58] | [58, 12, 45] | en |
| 13 | 136 | Rock | [45, 58] | [58, 12, 45] | en |
| 14 | 137 | Experimental | [1, 32] | [32, 1, 38] | en |
| 15 | 138 | Experimental | [1, 32] | [32, 1, 38] | en |
| 16 | 139 | Folk | [17] | [17] | en |
| 17 | 140 | Folk | [17] | [17] | en |
| 18 | 141 | Folk | [17] | [17] | en |
| 19 | 142 | Folk | [17] | [17] | en |
| 20 | 144 | Jazz | [4] | [4] | en |
| 21 | 145 | Jazz | [4] | [4] | en |
Based on the above, a track can have multiple possible genres. However, for simplicity, notice that track_genre_top already contains the genres that were selected previously, with the exception of International needing to be replaced by one of the lower level parent genres, and NaN values. The NaN values in track_genre_top correspond to when there is no clear distinction for what type of genre the track is, so assume those rows can be removed to simplify the dataset. Additionally, track_genres and track_genres_all need to be converted from strings to lists to make them possible to work with.
import ast
# Convert strings to list
tracks["track_genres"] = tracks["track_genres"].apply(ast.literal_eval)
tracks["track_genres_all"] = tracks["track_genres_all"].apply(ast.literal_eval)
# Remove rows with track_genre_top = NaN
tracks = tracks[tracks['track_genre_top'].notna()]
# Look at some rows of the International genre before defining logic to classify into the narrower genres
intnl_tracks = tracks[tracks["track_genre_top"] == "International"]
intnl_tracks.head(50)
| track_id | track_genre_top | track_genres | track_genres_all | track_language_code | |
|---|---|---|---|---|---|
| 434 | 666 | International | [79] | [2, 79] | en |
| 435 | 667 | International | [79] | [2, 79] | en |
| 472 | 704 | International | [46] | [2, 46] | es |
| 473 | 705 | International | [46] | [2, 46] | es |
| 474 | 706 | International | [46] | [2, 46] | es |
| 475 | 707 | International | [46] | [2, 46] | es |
| 476 | 708 | International | [46] | [2, 46] | es |
| 477 | 709 | International | [46] | [2, 46] | es |
| 607 | 853 | International | [2] | [2] | en |
| 821 | 1082 | International | [2] | [2] | en |
| 1339 | 1680 | International | [2] | [2] | en |
| 1340 | 1681 | International | [2] | [2] | en |
| 1341 | 1682 | International | [2] | [2] | en |
| 1342 | 1683 | International | [2] | [2] | en |
| 1343 | 1684 | International | [2] | [2] | en |
| 1344 | 1685 | International | [2] | [2] | en |
| 1345 | 1686 | International | [2] | [2] | en |
| 1346 | 1687 | International | [2] | [2] | en |
| 1347 | 1688 | International | [2] | [2] | en |
| 1348 | 1689 | International | [2] | [2] | en |
| 1909 | 3586 | International | [2] | [2] | en |
| 1910 | 3587 | International | [2] | [2] | en |
| 1911 | 3588 | International | [2] | [2] | en |
| 1912 | 3589 | International | [2] | [2] | en |
| 1913 | 3590 | International | [2] | [2] | en |
| 2076 | 3774 | International | [46, 117] | [2, 117, 46] | en |
| 2077 | 3775 | International | [46, 117] | [2, 117, 46] | en |
| 2078 | 3776 | International | [46, 117] | [2, 117, 46] | en |
| 2079 | 3777 | International | [46, 117] | [2, 117, 46] | en |
| 2080 | 3778 | International | [46, 117] | [2, 117, 46] | en |
| 2081 | 3779 | International | [46, 117] | [2, 117, 46] | en |
| 2191 | 3895 | International | [118] | [2, 118] | en |
| 2192 | 3896 | International | [118] | [2, 118] | en |
| 2193 | 3897 | International | [118] | [2, 118] | en |
| 2194 | 3898 | International | [118] | [2, 118] | en |
| 2195 | 3899 | International | [118] | [2, 118] | en |
| 2337 | 4070 | International | [46] | [2, 46] | en |
| 2338 | 4071 | International | [46] | [2, 46] | en |
| 2339 | 4072 | International | [46] | [2, 46] | en |
| 2340 | 4073 | International | [46] | [2, 46] | en |
| 2341 | 4074 | International | [46] | [2, 46] | en |
| 2342 | 4075 | International | [46] | [2, 46] | en |
| 2343 | 4076 | International | [46] | [2, 46] | en |
| 2344 | 4077 | International | [46] | [2, 46] | en |
| 2345 | 4078 | International | [46] | [2, 46] | en |
| 2346 | 4079 | International | [46] | [2, 46] | en |
| 2347 | 4080 | International | [46] | [2, 46] | en |
| 2348 | 4081 | International | [46] | [2, 46] | en |
| 2358 | 4091 | International | [117, 118, 130] | [2, 117, 118, 130] | en |
| 2359 | 4092 | International | [117, 118, 130] | [2, 117, 118, 130] | en |
Based on the above, explode track_genres since track_genres_all duplicates the international ID of 2. For any rows where track_genres!=2, replace track_genre_top with the narrower genre title (if it exists). Finally, filter out unused genres which are not being used as classification genres (these would show as NaN errors in the mapping). This is acceptable because with exploding track_genres, one of the values should correspond to the desired parent, and the other values will correspond to the more specific genre titles not being used.
Finally, duplicate track IDs should be removed since they would correspond to tracks where the genre is still unclear (e.g. a track which is both African and Middle Eastern). The purpose of this is to aid with classification by mapping tracks to a single genre.
# Define dictionary of international genres with key as genre id and value as title
international_genres_dict = dict(zip(international_genres_df["genre_id"], international_genres_df["title"]))
print(international_genres_dict)
{2: 'International', 46: 'Latin America', 79: 'Reggae - Dub', 86: 'Indian', 92: 'African', 102: 'Middle East', 130: 'Europe'}
# Explode based on track_genres
intnl_tracks_exploded = intnl_tracks.explode('track_genres')
# Replace track_genre_top with narrower label if available
intnl_tracks_exploded["track_genre_top"] = intnl_tracks_exploded["track_genres"].map(international_genres_dict)
# Remove NaN values arising from mapping (i.e. genre will not be used)
intnl_tracks_exploded = intnl_tracks_exploded[intnl_tracks_exploded['track_genre_top'].notna()]
# Only keep unique track ids
intnl_tracks_exploded = intnl_tracks_exploded[intnl_tracks_exploded.duplicated(['track_id']) == False]
# Verify that all track ids are unique
if not intnl_tracks_exploded.duplicated(subset=['track_id']).any():
print("No duplicate track ids are there in the international subset")
No duplicate track ids are there in the international subset
# Print intnl_tracks_exploded to verify that it looks ok
intnl_tracks_exploded.head(10)
| track_id | track_genre_top | track_genres | track_genres_all | track_language_code | |
|---|---|---|---|---|---|
| 434 | 666 | Reggae - Dub | 79 | [2, 79] | en |
| 435 | 667 | Reggae - Dub | 79 | [2, 79] | en |
| 472 | 704 | Latin America | 46 | [2, 46] | es |
| 473 | 705 | Latin America | 46 | [2, 46] | es |
| 474 | 706 | Latin America | 46 | [2, 46] | es |
| 475 | 707 | Latin America | 46 | [2, 46] | es |
| 476 | 708 | Latin America | 46 | [2, 46] | es |
| 477 | 709 | Latin America | 46 | [2, 46] | es |
| 607 | 853 | International | 2 | [2] | en |
| 821 | 1082 | International | 2 | [2] | en |
Now that the international tracks have been handled, they can be merged with the remaining non-international tracks. In doing the merge, only the track_id and track_genre_top are essential to keep. After that, we will verify that the values of track_genre_top align with the genres chosen for classification.
non_intnl_tracks = tracks[tracks["track_genre_top"] != "International"]
tracks_cleaned = pd.concat([non_intnl_tracks[["track_id", "track_genre_top"]],
intnl_tracks_exploded[["track_id", "track_genre_top"]]],
ignore_index=True)
tracks_cleaned = tracks_cleaned.rename(columns={'track_genre_top': 'genre'})
#Verify that tracks cleaned looks ok
tracks_cleaned.head(20)
| track_id | genre | |
|---|---|---|
| 0 | 2 | Hip-Hop |
| 1 | 3 | Hip-Hop |
| 2 | 5 | Hip-Hop |
| 3 | 10 | Pop |
| 4 | 134 | Hip-Hop |
| 5 | 135 | Rock |
| 6 | 136 | Rock |
| 7 | 137 | Experimental |
| 8 | 138 | Experimental |
| 9 | 139 | Folk |
| 10 | 140 | Folk |
| 11 | 141 | Folk |
| 12 | 142 | Folk |
| 13 | 144 | Jazz |
| 14 | 145 | Jazz |
| 15 | 146 | Jazz |
| 16 | 147 | Jazz |
| 17 | 148 | Experimental |
| 18 | 149 | Experimental |
| 19 | 150 | Experimental |
#Verify that tracks_cleaned does not contain any genres outside of those chosen for classification
classification_genres_set = set(classification_genres['title'].values)
tracks_cleaned_genres_set = set(tracks_cleaned['genre'].values)
if classification_genres_set.intersection(tracks_cleaned_genres_set) == classification_genres_set:
print("Genres in tracks_cleaned are valid")
Genres in tracks_cleaned are valid
The purpose of this section is to further cleanse the dataset to make visualization easier and to prepare the data into a form that is conducive for use in a classification or clustering model.
To begin, a consolidated dataset consisting of the tracks and genres selected for analysis, their MFCC features, and the echonest features should be created.
To assist with classification, the MFCC features will also be processed by PCA as shown in the example here to see if the MFCC features can be represented in a reduced form that lowers the data dimensionality. Prior to doing this, the features would need to be standardized to help improve results.
# Begin by filtering the mfccs dataframe to only the tracks that are part of tracks_cleaned.
# Drop the tracks with missing MFCC features
# To assist with visualization later, also bring in the echonest metrics for these tracks
dataset = pd.merge(left=tracks_cleaned, right=mfccs, on="track_id", how="left")
dataset = dataset.dropna() # drop tracks with missing MFCC features
dataset = pd.merge(left=dataset, right=echonest, on="track_id", how="left") # bring in echonest features
# Verify that the dataset looks ok
dataset.head(15)
| track_id | genre | mfcc.1 | mfcc.2 | mfcc.3 | mfcc.4 | mfcc.5 | mfcc.6 | mfcc.7 | mfcc.8 | ... | mfcc.19 | mfcc.20 | acousticness | danceability | energy | instrumentalness | liveness | speechiness | tempo | valence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Hip-Hop | -1.6377296448e+02 | 1.1669667816e+02 | -4.1753826141e+01 | 2.9144329071e+01 | -1.5050157547e+01 | 1.8879371643e+01 | -8.9181652069e+00 | 1.2002118111e+01 | ... | 1.1684176326e-01 | -5.7858843803e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 3 | Hip-Hop | -1.5900416565e+02 | 1.2015850067e+02 | -3.3233562469e+01 | 4.7342002869e+01 | -6.2473182678e+00 | 3.1405355453e+01 | -5.2618112564e+00 | 1.1618971825e+01 | ... | 4.7182095051e-01 | -1.5467071533e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 5 | Hip-Hop | -2.0544049072e+02 | 1.3221507263e+02 | -1.6085823059e+01 | 4.1514759064e+01 | -7.6429538727e+00 | 1.6942802429e+01 | -5.6512613297e+00 | 9.5694446564e+00 | ... | 2.7801498771e-01 | -5.4890155792e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 10 | Pop | -1.3586482239e+02 | 1.5704008484e+02 | -5.3453247070e+01 | 1.7198896408e+01 | 6.8680348396e+00 | 1.3934344292e+01 | -1.1749298096e+01 | 8.3607110977e+00 | ... | -3.8075449467e+00 | -6.7953306437e-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 134 | Hip-Hop | -2.0766148376e+02 | 1.2552130890e+02 | -3.3416591644e+01 | 3.2260929108e+01 | 8.0747709274e+00 | 1.5349553108e+01 | -4.0741791725e+00 | 1.0281721115e+01 | ... | -7.7040648460e-01 | -3.9955995083e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 135 | Rock | -9.0879714966e+01 | 1.5976300049e+02 | -4.2893623352e+01 | 3.5776615143e+01 | -1.8252986908e+01 | 2.0433145523e+01 | -7.9369482994e+00 | 1.2992751122e+01 | ... | 1.2024710178e+00 | -2.4587099552e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | 136 | Rock | -8.4803161621e+01 | 1.4372821045e+02 | -6.7442865372e+00 | 2.5492109299e+01 | 5.0692691803e+00 | 1.6982337952e+01 | -2.4718496799e+00 | 5.3463969231e+00 | ... | -1.7590852976e+00 | 4.0495863557e-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 137 | Experimental | -1.2391856384e+02 | 1.5573527527e+02 | -8.0915206909e+01 | 2.7569656372e+01 | 1.0932379723e+01 | 1.9220283508e+01 | -1.8276212692e+01 | -1.8288209915e+01 | ... | -1.0690842867e+00 | 1.7065395117e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 138 | Experimental | -7.9164611816e+01 | 8.5144149780e+01 | -3.1939628601e+01 | 3.0368049622e+01 | 3.4229247570e+00 | 1.2789516449e+01 | -1.5265024185e+01 | -1.3450626373e+01 | ... | -1.6415536404e+00 | 1.6512305737e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 139 | Folk | -1.2750725555e+02 | 1.5288587952e+02 | -5.8565074921e+01 | 4.9597194672e+01 | -6.6043100357e+00 | 2.2506578445e+01 | -7.1333312988e+00 | 9.7062120438e+00 | ... | -5.1429504156e-01 | 2.6310398579e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 140 | Folk | -2.2571331787e+02 | 1.3933282471e+02 | -1.3097699165e+01 | 4.4533355713e+01 | 2.4683995247e+00 | 2.8328742981e+01 | -9.9314813614e+00 | 1.0810856819e+01 | ... | -1.5860234201e-01 | 5.9409761429e-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | 141 | Folk | -2.5314390564e+02 | 1.5571632385e+02 | -1.6636627197e+01 | 2.3683815002e+01 | 6.0459570885e+00 | 1.1692952156e+01 | -9.9477605820e+00 | 6.8878135681e+00 | ... | 2.8093214035e+00 | 3.3257400990e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | 142 | Folk | -1.5323365784e+02 | 1.3514985657e+02 | -4.9444625854e+01 | 4.2056404114e+01 | -1.9741883278e+00 | 1.4290251732e+01 | -8.3063608408e-01 | 1.0496274948e+01 | ... | 6.0231194496e+00 | 5.0714325905e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | 144 | Jazz | -1.2692814636e+02 | 1.2631162262e+02 | -3.1843872070e+01 | 4.5561306000e+01 | -3.8294014335e-01 | 1.0514379501e+01 | -1.1236815453e+01 | 7.0307722092e+00 | ... | -1.0537501574e+00 | 1.1049208641e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | 145 | Jazz | -1.3589109802e+02 | 1.2842012024e+02 | -3.3427680969e+01 | 4.3987606049e+01 | -4.1240496635e+00 | 1.7416790009e+01 | -9.8095483780e+00 | 8.7755756378e+00 | ... | 1.0460131168e+00 | -1.6675879955e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 rows × 30 columns
import numpy as np
import sklearn as skl
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Standardization and PCA code adapted from https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
# Standardize the mfcc features of the dataset
X = dataset.loc[:, [c for c in dataset.columns if "mfcc" in c]]
X_std = StandardScaler().fit_transform(X)
# Apply PCA
pca = PCA(n_components=20)
principalComponents = pca.fit_transform(X_std)
explainedVariance = pca.explained_variance_ratio_
cumulativeEV = np.cumsum(explainedVariance)
Begin by looking at how the dataset is distibuted among the genres:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
track_counts = dataset.groupby('genre', as_index=False).count().sort_values(by='track_id', ascending=False)
fig = px.bar(track_counts, x='genre', y='track_id', labels = {"genre": "Genre", "track_id":"Count"}, title='Count of Tracks by Genre')
fig.show()
print("There are", len(dataset), "tracks in the dataset")
There are 40129 tracks in the dataset
Based on the above chart, the top 5 genres are rock, experimental, electronic, hip-hop and folk. Additionally, it should be noted that although the International genre was split into multiple genres, classification might not work well for each of these sub genres solely because the number of tracks is limited (i.e. Europe, African, Middle East, and Indian) have the fewest tracks.
Create a scree plot to visualize the effectiveness of using different choices on the number of principal components in explaining the MFCC feature variance:
fig = px.line(x=np.arange(1, 21, 1), y=cumulativeEV, labels=dict(x="PC", y="Explained Variance"), markers=True,
color=px.Constant("Cumulative Explained Variance"), title="PCA Scree Plot")
fig.add_bar(x=np.arange(1, 21, 1), y=explainedVariance, name="Explained Variance")
fig.show()
The scree plot shows that just over half of the dataset's variance can be explained by the first 3 PCs. The first 10 PCs explain a bit over 80% of the variance in the MFCCs.
For the next visualizations, plot the first 2 and first 3 PCs labelled by genre to see if any clusters are apparent. All plots in the remainder of this section can be filtered as needed by clicking on the series' in the legend. Double clicking on a series will isolate it from all others, whereas single clicking will toggle it on/off.
PCs_df = pd.DataFrame(data=principalComponents[:, :3], columns=["PC1", "PC2", "PC3"])
tracks_and_genres = dataset[["track_id", "genre"]]
PCs_df = pd.concat([PCs_df.reset_index(drop=True), tracks_and_genres.reset_index(drop=True)], axis=1)
fig = px.scatter(PCs_df, x="PC1", y="PC2", color="genre", symbol="genre", title="Top 2 MFCC Principal Components by Genre")
fig.show()
fig = px.scatter_3d(PCs_df, x="PC1", y="PC2", z="PC3", color='genre')
fig.update_layout(margin=dict(l=10, r=10, b=10, t=40), title='Top 3 MFCC Principal Components by Genre')
fig.update_scenes(xaxis_autorange="reversed", yaxis_autorange="reversed")
fig.show()
Looking at the principal components of MFCCs by genre plots, there are regions where certain genres are more likely to be, but there do not appear to be clear cluster boundaries or consistency in the shapes of the clusters for each genre. Some general observations on distinguishing between the genres are provided below:
For the final visualizations, the echonest features will be looked at in order to understand how music genres vary based on qualities such as acousticness, danceability, energy, instrumentalness, liveness, speechiness, tempo, valence. Although these features will not be used for classification, looking at them at this stage should provide clearer insights on how music genres vary. Note that not all genres will be present in each chart due to missing data in the echonest features.
# Calculate averages of echonest features and drop genres with missing info
echonest_averages = dataset.groupby('genre', as_index=False).agg({"acousticness": "mean",
"danceability":"mean",
"energy":"mean",
"instrumentalness":"mean",
"liveness":"mean",
"speechiness":"mean",
"tempo":"mean",
"valence":"mean"})
echonest_averages = echonest_averages.dropna()
echonest_averages
| genre | acousticness | danceability | energy | instrumentalness | liveness | speechiness | tempo | valence | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Blues | 0.883875 | 0.485561 | 0.390235 | 0.183444 | 0.122580 | 0.037252 | 119.513600 | 0.505450 |
| 2 | Classical | 0.986494 | 0.318438 | 0.054674 | 0.715784 | 0.218769 | 0.059107 | 99.240562 | 0.239131 |
| 5 | Electronic | 0.280795 | 0.590354 | 0.637690 | 0.749167 | 0.168776 | 0.097246 | 125.108284 | 0.431572 |
| 7 | Experimental | 0.599967 | 0.573539 | 0.430228 | 0.506951 | 0.164140 | 0.091255 | 123.640412 | 0.617081 |
| 8 | Folk | 0.747726 | 0.465819 | 0.343948 | 0.543683 | 0.154615 | 0.057183 | 118.175960 | 0.346635 |
| 9 | Hip-Hop | 0.363376 | 0.646300 | 0.586759 | 0.337519 | 0.188337 | 0.254353 | 117.676421 | 0.595177 |
| 11 | Instrumental | 0.577847 | 0.498469 | 0.502319 | 0.531401 | 0.202884 | 0.098989 | 114.778583 | 0.427666 |
| 12 | International | 0.788740 | 0.530034 | 0.424812 | 0.567316 | 0.188242 | 0.170889 | 125.104935 | 0.641766 |
| 13 | Jazz | 0.755227 | 0.383942 | 0.328465 | 0.702578 | 0.171593 | 0.082645 | 109.688016 | 0.287190 |
| 16 | Old-Time / Historic | 0.963293 | 0.503834 | 0.252559 | 0.637894 | 0.331007 | 0.142882 | 118.115613 | 0.560074 |
| 17 | Pop | 0.482880 | 0.574947 | 0.489717 | 0.375738 | 0.156410 | 0.061806 | 121.014888 | 0.433490 |
| 19 | Rock | 0.382597 | 0.393707 | 0.664871 | 0.600970 | 0.193606 | 0.064910 | 126.942851 | 0.413129 |
# Plot the averages for each genre
# Note that tempo should be considered separately in isolation because the values are not scaled like the others
fig = px.line( echonest_averages, x="genre",
y=['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'tempo', 'valence'],
title="Averages of Echonest Features by Genre")
# Hide tempo by default so that it doesn't skew the axes
for trace in fig['data']:
if trace['name'] in ["tempo"]:
trace['visible'] = 'legendonly'
fig.show()
# Obtain the standard deviations of each series above. Higher standard deviation would indicate more variance between genres for a given feature.
echonest_averages.std().sort_values(ascending=False)
tempo 7.716330 acousticness 0.239611 energy 0.171386 instrumentalness 0.168242 valence 0.129432 danceability 0.095593 speechiness 0.061185 liveness 0.051566 dtype: float64
Looking at the results above, tempo, acousticness, energy, and instrumentalness have the most variance. Some observations from the plots are noted below:
Given the results above, it would also be interesting to see how tempo, acousticness, energy, and instrumentalness are distributed among the genres as compared to looking at their averages, especially considering that they have the highest variability. To do this, use a KDE estimate rather than histograms (so the diagrams don't get too cluttered) and plot the probability density (i.e. area under each curve should be 1).
import math
import plotly.figure_factory as ff
# Helper function for plotting features distributions for genres with available data
def plotFeatureDistribution(dataset, feature_name):
f_name = feature_name.title()
#Get dictionary with genre as key and list of feature values (e.g. acousticness) as values
by_genre_dict = dataset.groupby("genre")[feature_name].apply(list).to_dict()
#Form the dictinary into a list of lists. Remove NaNs in each inner list
by_genre = [[x for x in by_genre_dict[k] if math.isnan(x) == False] for k in by_genre_dict.keys()]
# Remove genres where the list became empty and keep track of which genres are available
genres_available = []
by_genre_available = []
for i, genre in enumerate(by_genre):
if by_genre[i]: #List is not empty
by_genre_available.append(by_genre[i])
genres_available.append(list(by_genre_dict.keys())[i])
fig = ff.create_distplot(by_genre_available, genres_available, show_hist=False)
fig.update_layout(title_text='Distribution of ' + f_name + ' by Genre', xaxis=dict(title=f_name), yaxis=dict(title="Probability Density"))
fig.show()
plotFeatureDistribution(dataset, 'tempo')
For tempo, hip-hop and pop seem to have the lowest variance. However, hip-hop does seem to have a bimodal distribution with the main peak around 95 BPM, and a smaller peak around 180 BPM. This could be representative of hip-hop consisting of multiple styles where some are faster than others (i.e. rap music, which is also a subgenre of hip-hop in this dataset). Rock appears to have high variance in tempo, and in general, blues typically seem to have faster tempos (lower variance than rock and appearing towards the right of the figure).
plotFeatureDistribution(dataset, 'acousticness')
Classical clearly appears to have the lowest variance and is clustered around a score of 0.99, which is very similar to the average result in the Average of Echonest Features by Genre plot. Filtering that out by clicking on it in the legend and looking at the remaining curves, blues and old-time/historic have low variance, but are skewed to the right. Electronic and rock generally have low acousticness, while folk and jazz have high acousticness. Instrumental, experimental, and pop have fairly uniform distributions of acousticness.
plotFeatureDistribution(dataset, 'energy')
Classical has the lowest energy and variance. The variance for old-time/historical and blues are also genrally lower than the other genres, with old-time/historical peaking around 0.2 and blues peaking around 0.4. Rock and electronic have fairly uniform distributions, but concentrated slightly to the high side.
plotFeatureDistribution(dataset, 'instrumentalness')
Electronic, interestingly, is considered to have high instrumentalness, and also appears to have the lowest variance, followed by rock. Distributions for hip-hop and pop are skewed to the left on the other hand, and experimental appears to have the most uniform distribution of instrumentalness. Experimental also had a fairly uniform distribution for acousticness, which makes sense given the nature of this genre as discussed earlier in the comments for the PCA plots.
For genre classification based on the MFCC features, the 2 models that will be developed are a k-nearest neighbours (KNN) model and a random forest model. The KNN model has been chosen due to its simplicity for implementation, and because looking at the PCA visualizations, there does seem to be like there are particular regions where certain genres lie to help distinguish them from other genres. For example, considering hip-hop, folk, and old-time/historic in isolation on the PCA plots, it does seem like each of these genres occupy distinct regions. However, given the relatively large number of genres in the classification list, there are also lots of regions with overlap and no clear distinction. By using a KNN model, the hope is that choosing the class which has a majority in the k-nearest neighbours will lead to good results.
A random forest was chosen as the second model to provide results with less bias as compared to using a single decision tree, but again, since it would be relatively simple to implement. Although non-linear models such as a feedforward neural network would likely provide more accurate results, the goal with these models is to first see if simpler models suffice. As a future extension to this project, classification with a neural network could be interesting to implement.
Note that hyperparameter tuning will be done for both models within this same section.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
#Split the data into training, validation, and test sets 70-15-15
X_train_val, X_test, y_train_val, y_test = train_test_split(dataset.loc[:, [c for c in dataset.columns if "mfcc" in c]], dataset["genre"], test_size=0.15, random_state=21)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=15/85, random_state=21)
#Standardize the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.fit_transform(X_train)
X_val_std = scaler.fit_transform(X_val)
X_test_std = scaler.fit_transform(X_test)
#Apply PCA on the MFCC features
pca = PCA(n_components=20)
pca.fit(X_train_std)
X_train_pca = pca.transform(X_train_std)
X_val_pca = pca.transform(X_val_std)
X_test_pca = pca.transform(X_test_std)
Train a KNN model and vary k and the number of PCA components to use in the model. Keep the model that performs best on the validation set.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
bestValAccuracy = 0
bestK = -1
bestNumFeatures = -1
for k in range(15, 25):
for n_components in range(4, 11):
#Fit to training data
KNNmodel = KNeighborsClassifier(n_neighbors=k)
KNNmodel.fit(X_train_pca[:, :n_components], y_train)
y_pred_train = KNNmodel.predict(X_train_pca[:, :n_components])
y_pred_val = KNNmodel.predict(X_val_pca[:, :n_components])
train_acc = metrics.accuracy_score(y_train, y_pred_train)
val_acc = metrics.accuracy_score(y_val, y_pred_val)
if val_acc > bestValAccuracy:
bestValAccuracy = val_acc
bestK = k
bestNumFeatures = n_components
print("k:", k, "| Number of components:", n_components, "| Training Accuracy:", train_acc, "| Validation Accuracy:", val_acc)
print("\nThe best model on the validation set used k =", bestK, "and", bestNumFeatures, "PCA components. The validation set accuracy is", bestValAccuracy)
k: 15 | Number of components: 4 | Training Accuracy: 0.5075652390615544 | Validation Accuracy: 0.45348837209302323 k: 15 | Number of components: 5 | Training Accuracy: 0.5331268468083591 | Validation Accuracy: 0.4818936877076412 k: 15 | Number of components: 6 | Training Accuracy: 0.538360212182705 | Validation Accuracy: 0.48322259136212625 k: 15 | Number of components: 7 | Training Accuracy: 0.5460500551817438 | Validation Accuracy: 0.5006644518272425 k: 15 | Number of components: 8 | Training Accuracy: 0.5548435330556446 | Validation Accuracy: 0.5034883720930232 k: 15 | Number of components: 9 | Training Accuracy: 0.5600768984299904 | Validation Accuracy: 0.5129568106312292 k: 15 | Number of components: 10 | Training Accuracy: 0.5658442806792695 | Validation Accuracy: 0.5129568106312292 k: 16 | Number of components: 4 | Training Accuracy: 0.5053223681868347 | Validation Accuracy: 0.4569767441860465 k: 16 | Number of components: 5 | Training Accuracy: 0.5284987005589377 | Validation Accuracy: 0.48438538205980064 k: 16 | Number of components: 6 | Training Accuracy: 0.5352629143080921 | Validation Accuracy: 0.4840531561461794 k: 16 | Number of components: 7 | Training Accuracy: 0.5454448360568194 | Validation Accuracy: 0.5024916943521595 k: 16 | Number of components: 8 | Training Accuracy: 0.5515326284310584 | Validation Accuracy: 0.5029900332225914 k: 16 | Number of components: 9 | Training Accuracy: 0.5588664601801417 | Validation Accuracy: 0.5093023255813953 k: 16 | Number of components: 10 | Training Accuracy: 0.5646338424294207 | Validation Accuracy: 0.5144518272425249 k: 17 | Number of components: 4 | Training Accuracy: 0.5020826658122397 | Validation Accuracy: 0.45614617940199337 k: 17 | Number of components: 5 | Training Accuracy: 0.5279290825590088 | Validation Accuracy: 0.4845514950166113 k: 17 | Number of components: 6 | Training Accuracy: 0.5318452063085194 | Validation Accuracy: 0.48754152823920266 k: 17 | Number of components: 7 | Training Accuracy: 0.5438783865570151 | Validation Accuracy: 0.5029900332225914 k: 17 | Number of components: 8 | Training Accuracy: 0.5500373811812453 | Validation Accuracy: 0.5026578073089701 k: 17 | Number of components: 9 | Training Accuracy: 0.5576916230552885 | Validation Accuracy: 0.5089700996677741 k: 17 | Number of components: 10 | Training Accuracy: 0.5592224714300972 | Validation Accuracy: 0.507641196013289 k: 18 | Number of components: 4 | Training Accuracy: 0.5019402613122574 | Validation Accuracy: 0.4558139534883721 k: 18 | Number of components: 5 | Training Accuracy: 0.5261846274342269 | Validation Accuracy: 0.48322259136212625 k: 18 | Number of components: 6 | Training Accuracy: 0.530919577058635 | Validation Accuracy: 0.4885382059800664 k: 18 | Number of components: 7 | Training Accuracy: 0.5407810886824024 | Validation Accuracy: 0.5038205980066445 k: 18 | Number of components: 8 | Training Accuracy: 0.5483641283064545 | Validation Accuracy: 0.5051495016611296 k: 18 | Number of components: 9 | Training Accuracy: 0.5542027128057246 | Validation Accuracy: 0.5074750830564784 k: 18 | Number of components: 10 | Training Accuracy: 0.5569083983053864 | Validation Accuracy: 0.5083056478405316 k: 19 | Number of components: 4 | Training Accuracy: 0.4995193848125601 | Validation Accuracy: 0.45714285714285713 k: 19 | Number of components: 5 | Training Accuracy: 0.5238705543095162 | Validation Accuracy: 0.48255813953488375 k: 19 | Number of components: 6 | Training Accuracy: 0.5309551781836306 | Validation Accuracy: 0.48803986710963454 k: 19 | Number of components: 7 | Training Accuracy: 0.538431414432696 | Validation Accuracy: 0.5043189368770764 k: 19 | Number of components: 8 | Training Accuracy: 0.546584072056677 | Validation Accuracy: 0.5006644518272425 k: 19 | Number of components: 9 | Training Accuracy: 0.5516750329310406 | Validation Accuracy: 0.5088039867109635 k: 19 | Number of components: 10 | Training Accuracy: 0.556231976930471 | Validation Accuracy: 0.5109634551495017 k: 20 | Number of components: 4 | Training Accuracy: 0.4988785645626402 | Validation Accuracy: 0.45880398671096345 k: 20 | Number of components: 5 | Training Accuracy: 0.5228025205596497 | Validation Accuracy: 0.48438538205980064 k: 20 | Number of components: 6 | Training Accuracy: 0.531239987183595 | Validation Accuracy: 0.4883720930232558 k: 20 | Number of components: 7 | Training Accuracy: 0.5372565773078429 | Validation Accuracy: 0.5044850498338871 k: 20 | Number of components: 8 | Training Accuracy: 0.544768414681904 | Validation Accuracy: 0.5016611295681063 k: 20 | Number of components: 9 | Training Accuracy: 0.5507850048061519 | Validation Accuracy: 0.5101328903654485 k: 20 | Number of components: 10 | Training Accuracy: 0.5547011285556623 | Validation Accuracy: 0.5104651162790698 k: 21 | Number of components: 4 | Training Accuracy: 0.49862935668767133 | Validation Accuracy: 0.4589700996677741 k: 21 | Number of components: 5 | Training Accuracy: 0.5200612339349924 | Validation Accuracy: 0.48272425249169437 k: 21 | Number of components: 6 | Training Accuracy: 0.528712307308911 | Validation Accuracy: 0.4877076411960133 k: 21 | Number of components: 7 | Training Accuracy: 0.5366157570579231 | Validation Accuracy: 0.5021594684385382 k: 21 | Number of components: 8 | Training Accuracy: 0.5438783865570151 | Validation Accuracy: 0.49966777408637875 k: 21 | Number of components: 9 | Training Accuracy: 0.5493965609313254 | Validation Accuracy: 0.5073089700996678 k: 21 | Number of components: 10 | Training Accuracy: 0.550856207056143 | Validation Accuracy: 0.5098006644518273 k: 22 | Number of components: 4 | Training Accuracy: 0.4978461319377692 | Validation Accuracy: 0.46146179401993354 k: 22 | Number of components: 5 | Training Accuracy: 0.5192068069350991 | Validation Accuracy: 0.4850498338870432 k: 22 | Number of components: 6 | Training Accuracy: 0.5273594645590801 | Validation Accuracy: 0.48471760797342195 k: 22 | Number of components: 7 | Training Accuracy: 0.5344084873081989 | Validation Accuracy: 0.5014950166112957 k: 22 | Number of components: 8 | Training Accuracy: 0.5425255438071843 | Validation Accuracy: 0.5021594684385382 k: 22 | Number of components: 9 | Training Accuracy: 0.547687706931539 | Validation Accuracy: 0.5086378737541528 k: 22 | Number of components: 10 | Training Accuracy: 0.5496457688062942 | Validation Accuracy: 0.5101328903654485 k: 23 | Number of components: 4 | Training Accuracy: 0.4949624408131297 | Validation Accuracy: 0.46312292358803986 k: 23 | Number of components: 5 | Training Accuracy: 0.519064402435117 | Validation Accuracy: 0.48754152823920266 k: 23 | Number of components: 6 | Training Accuracy: 0.5257930150592759 | Validation Accuracy: 0.4885382059800664 k: 23 | Number of components: 7 | Training Accuracy: 0.5326996333084125 | Validation Accuracy: 0.5014950166112957 k: 23 | Number of components: 8 | Training Accuracy: 0.5421339314322332 | Validation Accuracy: 0.5041528239202658 k: 23 | Number of components: 9 | Training Accuracy: 0.5456584428067927 | Validation Accuracy: 0.5073089700996678 k: 23 | Number of components: 10 | Training Accuracy: 0.5505001958061875 | Validation Accuracy: 0.5117940199335548 k: 24 | Number of components: 4 | Training Accuracy: 0.4946064295631742 | Validation Accuracy: 0.4632890365448505 k: 24 | Number of components: 5 | Training Accuracy: 0.5176403574352949 | Validation Accuracy: 0.4878737541528239 k: 24 | Number of components: 6 | Training Accuracy: 0.5250097903093738 | Validation Accuracy: 0.48903654485049836 k: 24 | Number of components: 7 | Training Accuracy: 0.5310263804336217 | Validation Accuracy: 0.4978405315614618 k: 24 | Number of components: 8 | Training Accuracy: 0.5395706504325537 | Validation Accuracy: 0.5039867109634552 k: 24 | Number of components: 9 | Training Accuracy: 0.5444124034319484 | Validation Accuracy: 0.506312292358804 k: 24 | Number of components: 10 | Training Accuracy: 0.5492185553063477 | Validation Accuracy: 0.5119601328903655 The best model on the validation set used k = 16 and 10 PCA components. The validation set accuracy is 0.5144518272425249
# Plot a confusion matrix for the classification counts in the validation set
from sklearn.metrics import confusion_matrix
import plotly.figure_factory as ff
# Recreate best model and obtain validation set predictions
bestKNNModel = KNeighborsClassifier(n_neighbors=bestK)
bestKNNModel.fit(X_train_pca[:, :bestNumFeatures], y_train)
y_pred_val = bestKNNModel.predict(X_val_pca[:, :bestNumFeatures])
lbls = classification_genres["title"].values.tolist()
z = confusion_matrix(y_val, y_pred_val, labels=lbls)
fig = px.imshow(z, x=lbls, y=lbls, color_continuous_scale='Viridis', aspect="auto")
fig.update_traces(text=z, texttemplate="%{text}")
fig.update_xaxes(side="top", title="Predicted Class")
fig.update_yaxes(title="Actual Class")
fig.show()
The KNN model with the highest accuracy on the validation set used k=16 and the first 10 PCA components of the MFCC features; validation set accuracy was 51.4% for this model, and training set accuracy was 56.5%. This indicates that the model has been fit reasonably and is not overfit. Although an accuracy of 51.4% may seem low, it is important to keep in mind that since there are 22 genres, a random classifier would be expected to have an accuracy, on average, of 1/22 = 4.5%. With this relatively simple KNN model, validation set accuracy is more than 10x that of a baseline, random classifier identifying genres by chance.
The following observations are made based on the confusion matrix:
To try a model that is slightly more complex than the KNN, but not as complex as a feedforward neural network, a random forest model will be developed. Try using different criteria to measure split quality, the number of samples for splitting an internal node, and the number of PCA components to use as the hyperparameters to be be tuned.
Note that choosing between 4 and 10 PCA components has been implemented for both models in order to reduce dimensionality of the dataset (as compared to using all 20 features).
from sklearn.ensemble import RandomForestClassifier
bestValAccuracy = 0
bestCrit = ""
bestMinSamplesSplit = -1
bestNumFeatures = -1
for crit in ["gini", "entropy"]:
for mss in range(10, 100, 10):
for n_components in range(4, 11, 2):
RFmodel = RandomForestClassifier(n_estimators=100, criterion=crit, min_samples_split = mss, random_state=21)
RFmodel.fit(X_train_pca[:, :n_components], y_train)
y_pred_train = RFmodel.predict(X_train_pca[:, :n_components])
y_pred_val = RFmodel.predict(X_val_pca[:, :n_components])
train_acc = metrics.accuracy_score(y_train, y_pred_train)
val_acc = metrics.accuracy_score(y_val, y_pred_val)
if val_acc > bestValAccuracy:
bestValAccuracy = val_acc
bestCrit = crit
bestMinSamplesSplit = mss
bestNumFeatures = n_components
print("Criterion:", crit, "| Number of components:", n_components, "| min_samples_split =", mss, "| Training Accuracy:", train_acc, "| Validation Accuracy:", val_acc)
print("\nThe best model on the validation set used ", bestCrit, "criterion, min_samples_split =", bestMinSamplesSplit, "and", bestNumFeatures, "PCA components. The validation set accuracy is", bestValAccuracy)
Criterion: gini | Number of components: 4 | min_samples_split = 10 | Training Accuracy: 0.8013457225248318 | Validation Accuracy: 0.4700996677740864 Criterion: gini | Number of components: 6 | min_samples_split = 10 | Training Accuracy: 0.8449214995193848 | Validation Accuracy: 0.4998338870431894 Criterion: gini | Number of components: 8 | min_samples_split = 10 | Training Accuracy: 0.870767916266154 | Validation Accuracy: 0.5176079734219269 Criterion: gini | Number of components: 10 | min_samples_split = 10 | Training Accuracy: 0.8971483498878564 | Validation Accuracy: 0.5337209302325582 Criterion: gini | Number of components: 4 | min_samples_split = 20 | Training Accuracy: 0.6529958346683755 | Validation Accuracy: 0.47475083056478407 Criterion: gini | Number of components: 6 | min_samples_split = 20 | Training Accuracy: 0.693616718288298 | Validation Accuracy: 0.5071428571428571 Criterion: gini | Number of components: 8 | min_samples_split = 20 | Training Accuracy: 0.7176830787852896 | Validation Accuracy: 0.509468438538206 Criterion: gini | Number of components: 10 | min_samples_split = 20 | Training Accuracy: 0.7389013492826373 | Validation Accuracy: 0.524750830564784 Criterion: gini | Number of components: 4 | min_samples_split = 30 | Training Accuracy: 0.6008757876748905 | Validation Accuracy: 0.47840531561461797 Criterion: gini | Number of components: 6 | min_samples_split = 30 | Training Accuracy: 0.6366549182954181 | Validation Accuracy: 0.5003322259136213 Criterion: gini | Number of components: 8 | min_samples_split = 30 | Training Accuracy: 0.655808323543024 | Validation Accuracy: 0.5109634551495017 Criterion: gini | Number of components: 10 | min_samples_split = 30 | Training Accuracy: 0.6738224927907722 | Validation Accuracy: 0.5209302325581395 Criterion: gini | Number of components: 4 | min_samples_split = 40 | Training Accuracy: 0.572679696678415 | Validation Accuracy: 0.47790697674418603 Criterion: gini | Number of components: 6 | min_samples_split = 40 | Training Accuracy: 0.6022998326747125 | Validation Accuracy: 0.49900332225913624 Criterion: gini | Number of components: 8 | min_samples_split = 40 | Training Accuracy: 0.6207056142974118 | Validation Accuracy: 0.5051495016611296 Criterion: gini | Number of components: 10 | min_samples_split = 40 | Training Accuracy: 0.6372245362953469 | Validation Accuracy: 0.5166112956810631 Criterion: gini | Number of components: 4 | min_samples_split = 50 | Training Accuracy: 0.5560539713054933 | Validation Accuracy: 0.479734219269103 Criterion: gini | Number of components: 6 | min_samples_split = 50 | Training Accuracy: 0.5828260173021468 | Validation Accuracy: 0.5024916943521595 Criterion: gini | Number of components: 8 | min_samples_split = 50 | Training Accuracy: 0.5997721528000285 | Validation Accuracy: 0.509468438538206 Criterion: gini | Number of components: 10 | min_samples_split = 50 | Training Accuracy: 0.6136565915482929 | Validation Accuracy: 0.5162790697674419 Criterion: gini | Number of components: 4 | min_samples_split = 60 | Training Accuracy: 0.5445904090569262 | Validation Accuracy: 0.476578073089701 Criterion: gini | Number of components: 6 | min_samples_split = 60 | Training Accuracy: 0.5703656235537043 | Validation Accuracy: 0.4995016611295681 Criterion: gini | Number of components: 8 | min_samples_split = 60 | Training Accuracy: 0.5832176296770978 | Validation Accuracy: 0.5066445182724253 Criterion: gini | Number of components: 10 | min_samples_split = 60 | Training Accuracy: 0.5981345010502331 | Validation Accuracy: 0.5167774086378738 Criterion: gini | Number of components: 4 | min_samples_split = 70 | Training Accuracy: 0.5342304816832212 | Validation Accuracy: 0.4777408637873754 Criterion: gini | Number of components: 6 | min_samples_split = 70 | Training Accuracy: 0.5581544376802307 | Validation Accuracy: 0.5001661129568107 Criterion: gini | Number of components: 8 | min_samples_split = 70 | Training Accuracy: 0.5716828651785396 | Validation Accuracy: 0.5053156146179402 Criterion: gini | Number of components: 10 | min_samples_split = 70 | Training Accuracy: 0.584463669051942 | Validation Accuracy: 0.5119601328903655 Criterion: gini | Number of components: 4 | min_samples_split = 80 | Training Accuracy: 0.52767987468404 | Validation Accuracy: 0.4744186046511628 Criterion: gini | Number of components: 6 | min_samples_split = 80 | Training Accuracy: 0.5489693474313788 | Validation Accuracy: 0.5019933554817275 Criterion: gini | Number of components: 8 | min_samples_split = 80 | Training Accuracy: 0.5628893873046388 | Validation Accuracy: 0.5046511627906977 Criterion: gini | Number of components: 10 | min_samples_split = 80 | Training Accuracy: 0.5720388764284952 | Validation Accuracy: 0.5119601328903655 Criterion: gini | Number of components: 4 | min_samples_split = 90 | Training Accuracy: 0.5208800598098899 | Validation Accuracy: 0.4777408637873754 Criterion: gini | Number of components: 6 | min_samples_split = 90 | Training Accuracy: 0.542703549432162 | Validation Accuracy: 0.49883720930232556 Criterion: gini | Number of components: 8 | min_samples_split = 90 | Training Accuracy: 0.55551995443056 | Validation Accuracy: 0.503156146179402 Criterion: gini | Number of components: 10 | min_samples_split = 90 | Training Accuracy: 0.5655238705543095 | Validation Accuracy: 0.5129568106312292 Criterion: entropy | Number of components: 4 | min_samples_split = 10 | Training Accuracy: 0.7616504681547936 | Validation Accuracy: 0.4729235880398671 Criterion: entropy | Number of components: 6 | min_samples_split = 10 | Training Accuracy: 0.8140553241482431 | Validation Accuracy: 0.501328903654485 Criterion: entropy | Number of components: 8 | min_samples_split = 10 | Training Accuracy: 0.8397237352700345 | Validation Accuracy: 0.5074750830564784 Criterion: entropy | Number of components: 10 | min_samples_split = 10 | Training Accuracy: 0.8715155398910606 | Validation Accuracy: 0.5267441860465116 Criterion: entropy | Number of components: 4 | min_samples_split = 20 | Training Accuracy: 0.6129445690483819 | Validation Accuracy: 0.47524916943521595 Criterion: entropy | Number of components: 6 | min_samples_split = 20 | Training Accuracy: 0.6498629356687672 | Validation Accuracy: 0.4998338870431894 Criterion: entropy | Number of components: 8 | min_samples_split = 20 | Training Accuracy: 0.6749973299156253 | Validation Accuracy: 0.5083056478405316 Criterion: entropy | Number of components: 10 | min_samples_split = 20 | Training Accuracy: 0.6946135497881734 | Validation Accuracy: 0.5176079734219269 Criterion: entropy | Number of components: 4 | min_samples_split = 30 | Training Accuracy: 0.5647050446794118 | Validation Accuracy: 0.4785714285714286 Criterion: entropy | Number of components: 6 | min_samples_split = 30 | Training Accuracy: 0.5952508099255936 | Validation Accuracy: 0.5001661129568107 Criterion: entropy | Number of components: 8 | min_samples_split = 30 | Training Accuracy: 0.6157214567980348 | Validation Accuracy: 0.5034883720930232 Criterion: entropy | Number of components: 10 | min_samples_split = 30 | Training Accuracy: 0.6323827832959521 | Validation Accuracy: 0.5166112956810631 Criterion: entropy | Number of components: 4 | min_samples_split = 40 | Training Accuracy: 0.5404250774324468 | Validation Accuracy: 0.4744186046511628 Criterion: entropy | Number of components: 6 | min_samples_split = 40 | Training Accuracy: 0.569368792053829 | Validation Accuracy: 0.5 Criterion: entropy | Number of components: 8 | min_samples_split = 40 | Training Accuracy: 0.5840364555519955 | Validation Accuracy: 0.5051495016611296 Criterion: entropy | Number of components: 10 | min_samples_split = 40 | Training Accuracy: 0.5981701021752287 | Validation Accuracy: 0.5124584717607974 Criterion: entropy | Number of components: 4 | min_samples_split = 50 | Training Accuracy: 0.5263982341842002 | Validation Accuracy: 0.4790697674418605 Criterion: entropy | Number of components: 6 | min_samples_split = 50 | Training Accuracy: 0.5520310441809961 | Validation Accuracy: 0.4995016611295681 Criterion: entropy | Number of components: 8 | min_samples_split = 50 | Training Accuracy: 0.5666631065541671 | Validation Accuracy: 0.5039867109634552 Criterion: entropy | Number of components: 10 | min_samples_split = 50 | Training Accuracy: 0.5802627363024672 | Validation Accuracy: 0.5111295681063123 Criterion: entropy | Number of components: 4 | min_samples_split = 60 | Training Accuracy: 0.5164299191854462 | Validation Accuracy: 0.4775747508305648 Criterion: entropy | Number of components: 6 | min_samples_split = 60 | Training Accuracy: 0.5419203246822599 | Validation Accuracy: 0.4958471760797342 Criterion: entropy | Number of components: 8 | min_samples_split = 60 | Training Accuracy: 0.5554131510555733 | Validation Accuracy: 0.503156146179402 Criterion: entropy | Number of components: 10 | min_samples_split = 60 | Training Accuracy: 0.5662714941792161 | Validation Accuracy: 0.5109634551495017 Criterion: entropy | Number of components: 4 | min_samples_split = 70 | Training Accuracy: 0.5106625369361671 | Validation Accuracy: 0.47807308970099666 Criterion: entropy | Number of components: 6 | min_samples_split = 70 | Training Accuracy: 0.5323792231834525 | Validation Accuracy: 0.4951827242524917 Criterion: entropy | Number of components: 8 | min_samples_split = 70 | Training Accuracy: 0.5428815550571398 | Validation Accuracy: 0.49767441860465117 Criterion: entropy | Number of components: 10 | min_samples_split = 70 | Training Accuracy: 0.555911566805511 | Validation Accuracy: 0.5046511627906977 Criterion: entropy | Number of components: 4 | min_samples_split = 80 | Training Accuracy: 0.5040407276869949 | Validation Accuracy: 0.4744186046511628 Criterion: entropy | Number of components: 6 | min_samples_split = 80 | Training Accuracy: 0.5261846274342269 | Validation Accuracy: 0.49601328903654485 Criterion: entropy | Number of components: 8 | min_samples_split = 80 | Training Accuracy: 0.5368293638078964 | Validation Accuracy: 0.496843853820598 Criterion: entropy | Number of components: 10 | min_samples_split = 80 | Training Accuracy: 0.5474028979315746 | Validation Accuracy: 0.5049833887043189 Criterion: entropy | Number of components: 4 | min_samples_split = 90 | Training Accuracy: 0.4985225533126847 | Validation Accuracy: 0.476578073089701 Criterion: entropy | Number of components: 6 | min_samples_split = 90 | Training Accuracy: 0.5195628181850547 | Validation Accuracy: 0.49601328903654485 Criterion: entropy | Number of components: 8 | min_samples_split = 90 | Training Accuracy: 0.5305991669336751 | Validation Accuracy: 0.4961794019933555 Criterion: entropy | Number of components: 10 | min_samples_split = 90 | Training Accuracy: 0.5414931111823134 | Validation Accuracy: 0.5028239202657807 The best model on the validation set used gini criterion, min_samples_split = 10 and 10 PCA components. The validation set accuracy is 0.5337209302325582
# Plot a confusion matrix for the classification counts in the validation set
# Recreate best model and obtain validation set predictions
bestRFModel = RandomForestClassifier(n_estimators=100, criterion=bestCrit, min_samples_split = bestMinSamplesSplit, random_state=21)
bestRFModel.fit(X_train_pca[:, :bestNumFeatures], y_train)
y_pred_val = bestRFModel.predict(X_val_pca[:, :bestNumFeatures])
lbls = classification_genres["title"].values.tolist()
z = confusion_matrix(y_val, y_pred_val, labels=lbls)
fig = px.imshow(z, x=lbls, y=lbls, color_continuous_scale='Viridis', aspect="auto")
fig.update_traces(text=z, texttemplate="%{text}")
fig.update_xaxes(side="top", title="Predicted Class")
fig.update_yaxes(title="Actual Class")
fig.show()
The random forest model with the highest accuracy on the validation set used the gini criterion, min_samples_split=10, and the first 10 PCA components of the MFCC features. Validation set accuracy was 53.4%, and training accuracy was 89.7%. Although the validation set accuracy is 2% higher here than for the KNN, this model is clearly much more overfit to the training data. Therefore, for testing, it would be better to use the KNN model since it is not as likely to be overfit.
The following observations are made based on the confusion matrix:
Based on the observations made above, test on the KNN model.
y_pred_test = bestKNNModel.predict(X_test_pca[:, :10])
test_acc = metrics.accuracy_score(y_test, y_pred_test)
print("The accuracy of the KNN model on the test set is", test_acc)
The accuracy of the KNN model on the test set is 0.5102990033222591
lbls = classification_genres["title"].values.tolist()
z = confusion_matrix(y_test, y_pred_test, labels=lbls)
fig = px.imshow(z, x=lbls, y=lbls, color_continuous_scale='Viridis', aspect="auto")
fig.update_traces(text=z, texttemplate="%{text}")
fig.update_xaxes(side="top", title="Predicted Class")
fig.update_yaxes(title="Actual Class")
fig.show()
The accuracy on the test set is 51.0%, which is very similar to the validation set accuracy for this model (51.4%). This indicates that the model is not overfit, and although seemingly low, this result is still acceptable given it is 10x more accurate than a classifier choosing at random, and given the large number of classification genres.
The following observations are made based on the confusion matrix:
For visualization purposes, the plots below show the first 2 and first 3 PCA components of the MFCC features segmented by actual vs. predicted class. These plots can be filtered as desired in order to gain insight about some of the challenges in segmenting between genres. For example, comparing correctly classified rock with experimental that was predicted to be rock in the graphs below, there is no clear-cut decision boundary apparent.
PCs_df_test = pd.DataFrame(data=X_test_pca[:, :3], columns=["PC1", "PC2", "PC3"])
actual_and_predictions_test = pd.DataFrame(data=np.c_[y_test, y_pred_test], columns=["Actual Class", "Predicted Class"])
PCs_df_test = pd.concat([PCs_df_test.reset_index(drop=True), actual_and_predictions_test.reset_index(drop=True)], axis=1)
fig = px.scatter(PCs_df_test, x="PC1", y="PC2", color="Actual Class", symbol="Predicted Class", title="Top 2 MFCC Principal Components of Test Set Predictions")
fig.show()
fig = px.scatter_3d(PCs_df_test, x="PC1", y="PC2", z="PC3", color="Actual Class", symbol="Predicted Class")
fig.update_layout(margin=dict(l=10, r=10, b=10, t=40), title='Top 3 MFCC Principal Components of Test Set Predictions')
fig.update_scenes(xaxis_autorange="reversed", yaxis_autorange="reversed")
fig.show()